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CONVERGENCE PROPERTIES OF THE RANDOMIZED 
EXTENDED GAUSS-SEIDEL AND KACZMARZ METHODS 
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Abstract. The Kaczmarz and Gauss-Seidel methods both solve a linear system X{3 = y by 
iteratively refining the solution estimate. Recent interest in these methods has been sparked by a 
proof of Strohmer and Vershynin which shows the randomized Kaczmarz method converges linearly 
in expectation to the solution. Lewis and Leventhal then proved a similar result for the randomized 
Gauss-Seidel algorithm. However, the behavior of both methods depends heavily on whether the 
system is under or over deter mined, and whether it is consistent or not. Here we provide a unified 
theory of both methods, their variants for these different settings, and draw connections between 
both approaches. In doing so, we also provide a proof that an extended version of randomized Gauss- 
Seidel converges linearly to the least norm solution in the underdetermined case (where the usual 
randomized Gauss Seidel fails to converge). We detail analytically and empirically the convergence 
properties of both methods and their extended variants in all possible system settings. With this 
result, a complete and rigorous theory of both methods is furnished. 


1. Introduction. We consider solving a linear system of equations 

Xf3 = y , (1.1) 

for a (real or complex) m x n matrix X , in various problem settings. Recent interest 
in the topic was reignited when Strohmer and Vershynin [26] proved the linear 1 con¬ 
vergence rate of the Randomized Kaczmarz (RK) algorithm that works on the rows 
of X (data points). Following that, Leventhal and Lewis [15] proved the linear con¬ 
vergence of a Randomized Gauss-Seidel (RGS), i.e. Randomized Coordinate Descent, 
algorithm that works on the columns of X (features). 

When the system of equations is inconsistent (i.e. has no exact solution), as is 
typically the case when m > n in real-world overconstrained systems, RK is known 
to not converge to the ordinary least squares solution 

(3ls :=argnuni||y-X/3||| (1.2) 

as studied by Needed [18]. Zouzias and Freris [30] extended the RK method with the 
modified Randomized Extended Kaczmarz (REK) algorithm, which linearly converges 
to f3 ls- Interestingly, in this setting, we will argue in Section 3.3 that RGS does 
converge to /3ls without any special extensions. 

1.1. Motivation and contribution. The above introduction represents only 
half the story. When m < n, there are fewer constraints than variables, and the 
system has infinitely many solutions. In this case, especially if we have no prior reason 
to believe any additional sparsity in the signal structure, we are often interested in 
finding the least Euclidean norm solution: 

Aljv := argnun ||/3|| 2 s.t. y = Xf3. (1.3) 

While RGS converges to (3ls hi the overcomplete setting, we shall argue in Section 
3.3 that in the undercomplete setting it does not converge to (3ln- We will also argue 
that RK does converge to (3ln without any extensions in this setting. 


1 Mathematicians often refer to linear convergence as exponential convergence. 
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The main contribution of our paper is to provide a unified theory of these related 
iterative methods. We will also construct an extension to RGS that parallels REK, 
which unlike RGS does converge to (3ln (just as REK, unlike RK, converges to (3ls )• 
Some desired properties for this algorithm include that it should also converge linearly, 
not require much extra computation, and work well in simulations. We shall see that 
our Randomized Extended Gauss-Seidcl (REGS) method does indeed possess these 
desired properties. A summary of this unified theory is provided in Table 1.1. 


Method 

Overconstrained, 
consistent : 
convergence to /3*? 

Overconstrained, 
inconsistent : 
convergence to /3 ls? 

Underconstrained : 
convergence to (3ln ? 

RK 

Yes [26] 

No [18, Thm. 2.1] 

Yes (Sec. 3.3) 

REK 

Yes [30] 

Yes [30] 

Yes (Sec. 3.3) 

RGS 

Yes [15] 

Yes [15] 

No (Sec. 3.3) 

REGS 

Yes (Remark 1) 

Yes (Sec. 4.3) 

Yes (Thm. 4.1) 


Table 1.1 


Summary of convergence properties for the overdetermined and consistent setting, overdeter¬ 
mined and inconsistent setting, and underdetermined settings. We write (3 * to denote the solution 
to (1.1) in the overdetermined consistent setting, with (3 ls an d Pln being defined in (1.2) and 
(1.3) for the other two settings. 


1.2. Paper Outline. In Section 2 we recap the three main existing algorithms 
mentioned in the introduction (RK, RGS, REK). We discuss the performance of these 
algorithms in the three natural settings described in Table 1.1 in Section 3. Section 4 
introduces our proposed algorithm (REGS) and proves its linear convergence to the 
least norm solution, completing the theoretical framework. Lastly, we end with some 
simulation experiments in Section 5 to demonstrate the tightness and usefulness of 
our theory, and conclude in Section 6. 

2. Existing Algorithms and Related Work. In this section, we will sum¬ 
marize the algorithms mentioned in the introduction, i.e. RK, RGS and REK. We 
will describe their iterative update rules and mention their convergence guarantees, 
leaving the details of convergence to the next section. Throughout the paper we will 
use the notation X 1 to represent the zth row of X (or *th entry in the case of a vector) 
and X f :j ) to denote the j th column of a matrix X. We will write the estimation (3 as a 
column vector. We write vectors and matrices in boldface, and constants in standard 
font. 

2.1. Randomized Kaczmarz (RK). The Kaczmarz method was first intro¬ 
duced in the notable work of Kaczmarz [14]. It has gained recent interest in tomog¬ 
raphy research where it is known as the Algebraic Reconstruction Technique (ART) 
[8, 17, 1, 13]. Although in its original form the method selects rows in a deterministic 
fashion (often simply cyclically), it has been well observed that a random selection 
scheme reduces the possibility of a poor choice of row ordering [9, 12]. Earlier con¬ 
vergence analysis of the randomized variant were obtained (e.g. [29]), but yielded 
bounds with expressions that were difficult to evaluate. Strohmer and Vershynin [26] 
showed that the RK method described above has an expected linear convergence rate 
to the solution (3* of (1.1), and are the first to provide an explicit convergence rate 
in expectation which depends only on the geometric properties of the system. This 
work was extended by Needell [18] to the inconsistent case, analyzed almost surely 
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by Chen and Powell [3], accelerated in several ways [7, 6, 24, 21, 20], and extended to 
more general settings [15, 25, 19]. 

We describe here the randomized variant of the Kaczmarz method put forth by 
Strohmer and Vershynin [26]. Taking X,y as input and starting from an arbitrary 
initial estimate for (3 (for example /3 0 = 0), RK repeats the following in each iteration. 
First, a random row i E {l,...,m} is selected with probability proportional to its 
Euclidean norm, i.e. 

Pr(row = !) = W||, 

where ||X||f denotes the Frobenius norm of X. Then, project the current iterate 
onto that row, i.e. 

ft+i : =ft + fa '| | (2.1) 

where here and throughout X* denotes the (conjugate) transpose of X. 

Intuitively, this update can be seen as greedily satisfying the ith equation in the 
linear system. Indeed, it is easy to see that after the update, 

X l (3 t+l =y\ (2.2) 

Referring to (1.2) and defining 

m 

L(f3) = ±|| y- X(3\\ 2 = ± - X‘(3) 2 , 

i=i 


we can alternatively interpret this update as stochastic gradient descent (choosing a 
random data-point on which to update), where the step size is the inverse Lipschitz 
constant of the stochastic gradient 

S7 2 \{y i -X i (3f = \\X i \\l 

2.2. Randomized Extended Kaczmarz (REK). For inconsistent systems, 
the RK method does not converge to the least-squares solution as one might desire. 
This fact is clear since the method at each iteration projects completely onto a se¬ 
lected solution space, being unable to break the so-called convergence horizon. One 
approach to overcome this is to use relaxation parameters, so that the estimates are 
not projected completely onto the subspace at each iteration [28, 27, 2, 10]. Re¬ 
cently, Zouzias and Freris [30] proposed a variant of the RK method motivated by the 
work of Popa [23] which instead includes a random projection to iteratively reduce 
the component of y which is orthogonal to the range of X. This method, named 
Randomized Extended Kaczmarz (REK) can be described by the following iteration 
updates, which can be initialized with /3 0 = 0 and zq = y: 


Pt +1 Pt 


(: v * - 4 - x l P t ) 


lix 1 


(. x l y 


Zt+l = Z t - 


(X(j), Zt) 


\\x 


U) 


- U) ■ 


(2.3) 


Here, a column j S {1,..., n} is also selected at random with probability proportional 
to its Euclidean norm: 


Pr (column = j) = ||x||2 


(2.4) 
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and again X(j) denotes the jth column of X. Here, z t approximates the component 
of y which is orthogonal to the range of X , allowing for the iterates (3 t to converge to 
the true least-squares solution of the system. Zouzias and Freris [30] prove that REK 
converges linearly in expectation to this solution (3ls- 

2.3. Randomized Gauss-Seidel (RGS). Again taking X,y as input and 
starting from an arbitrary /3 0 , the Randomized Gauss-Seidel (RGS) method (or the 
Randomized Coordinate Descent method) repeats the following in each iteration. 
First, a random column j £ {l,...,n} is selected as in (2.4). We then minimize the 
objective L(j3) = ^\\y — X/3||| with respect to this coordinate to get 


fit +1 fit 


X* {j) (y-Xf3 t ) 


\\X 


U) 


Hi) 


(2.5) 


where is the jth coordinate basis column vector (all zeros with a 1 in the jth 
position). It can be seen as greedily minimizing the objective with respect to the jth 
coordinate. Indeed, letting represent X without its jth column and f3 

without its jth coordinate, 

— = -X* a) (y - X(3) = -X* (j) (y - X { _ j)f 3^ - X U) p). (2.6) 

Setting this equal to zero for the coordinate-wise minimization, we get the aforemen¬ 
tioned update (2.5) for /3 J . Alternatively, since [VL(/3)p = —X*^(y — X/3), the 
above update can intuitively be seen as a univariate descent step where the step size 
is the inverse Lipschitz constant of the gradient along the jth coordinate, since the 
(j, j) entry of the Hessian is 

[V 2 L(/3)]yj = {X*X)u = ||X 0 .)|| 2 . 


Leventhal and Lewis [15] showed that this algorithm has an expected linear con¬ 
vergence rate. We will detail the convergence properties of this algorithm and the 
others in the next section. 

3. Problem Variations. We first examine the differences in behavior of the 
two algorithms RGS and RK in three distinct but related settings. This will highlight 
the opposite behaviors of these two similar algorithms. 

When the system of equations (1.1) has a unique solution, we represent this by 
(3*. This happens when m > n, and the system is consistent. Assuming that X has 
full column rank, 


/3* = {X*X)~ 1 X*y, 


(3.1) 


and then X/3* = y. 

When (1.1) does not have any consistent solution, we refer to the least-squares 
solution of (1.2) as (3ls- This could happen in the overconstrained case, when m > n. 
Again, assuming that X has full column rank, we have 

fi LS = (X*X)~ 1 X*y, (3.2) 

and we can write r := y — X(3ls as the residual vector. 
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When (1.1) has infinitely many solutions, we call the minimum Euclidean norm 
solution given by (1.3), Pln■ This could happen in the underconstrained case, when 
m < n. Assuming that X has full row rank, we have 


Pln = X*{XX*)~ x y. 


(3-3) 


In the above notation, the LS stands for Least Squares and LN for Least Norm. We 


shall return to each of these three situations in that order in future sections. 

One of our main contributions is to achieve a unified understanding of the behavior 
of RK and RGS in these different situations. The literature for RK deals mainly with 
the first two settings only (see [26], [18], [30]). In the third setting, one readily obtains 
convergence to an arbitrary solution (see e.g. (3) of [16]), but the convergence to the 
least norm solution is not often studied (likely for practical reasons). The literature 
for RGS typically focuses on more general setups than our specific quadratic least 
squares loss function L(/3) (see Nesterov [22] or Richtarik and Takac [25]). However, 
for both the purposes of completeness, and for a more thorough understanding of 
the relationship between RK and RGS, it turns out to be crucial to analyze all three 
settings (for equations (1.1)-(1.3)). 

1. When f3* is a unique consistent solution, we present proofs of the linear 
convergence of both algorithms - the results are known from papers by [26] 
and [15] but are presented here in a novel manner so that their relationship 
becomes clearer and direct comparison is easily possible. 

2. When /3 ls is the (inconsistent) least squares solution, we show why RGS 
iterates converge linearly to Pls, but RK iterates do not - making RGS 
preferable. These facts are not hard to see, but we make it more intuitively 
and mathematically clear why this should be the case. 

3. When (3ln is the minimum norm consistent solution, we explain why RK 
converges linearly to it, but RGS iterates do not (both so far seemingly un¬ 
documented observations) - making RK preferable. 

Together, the above three points complete the picture (with solid accompanying 
intuition) of the opposing behavior of RK and RGS. Later, we will present our variant 
of the RGS method, the Randomized Extended Gauss-Seidel (REGS), and compare 
with the corresponding variant of RK (REK). This new analysis will complete the 
unified framework for these methods. 

3.1. Overconstrained System, Consistent. Here we will assume that m > n, 
X has full column rank, and the system is consistent, so y = X/3*. First, let us write 
the updates used by both algorithms in a revealing fashion. If RK and RGS select 
row i and column j at step t + 1, and e* (resp. e( 7 ) ) is the ith coordinate basis row 
(resp. column) vector, then the updates can be rewritten as: 


(RGS) 


(RK) 



(3.4) 


(3.5) 
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where r t = y — X/3 t = Xf3* — X(3 t is the residual vector at iteration t. Then 
multiplying both equations by X gives 


(RK) Xf3 t+1 := Xf3 t + ** Pt) X(Xy (3.6) 

II-* II2 

(RGS) Xf3 t+1 := X(3 t + 0 v X {j) . (3.7) 

11-* U) II 2 

We now come to an important difference, which is the key update equation for 
RK and RGS. 

First, from the update (3.4) for RK, we have that (3 t+1 — (3 t is parallel to X 1 . 
Also, (3 t+ i — f3* is orthogonal to X 1 (since X*(/3 t+1 — (3*) = y l — y l = 0). Then by 
the Pythagorean theorem, 

\\flt+i - P*\\l = II Pt - P*\\l ~ \\@t+i - Ml (3.8) 

Note that from the update (3.7), we have that Xj3 t+1 —X(3 t is parallel to X^y Also, 
X/3 t+1 —X(3* is orthogonal to X U) (since X* {j) {X/3 t+1 -X/3*) = X* {j) (X/3 t+1 -y) = 
0 by the optimality condition dL/df3 J = 0). Then again by the Pythagorean theorem, 

\\X(3 t+1 - X/3*||! = \\Xf3 t - X(3*\\l - \\X(3 t+l - Xp t \&. (3.9) 

The rest of the proof follows by simply substituting for the last term in the above 
two equations, and is presented in the following table for easy comparison. Note 
£ = X*X is the full-rank covariance matrix and we first take expectations with 
respect to the randomness at the (t+ l)st step, conditioning on all randomness up to 
the tth step. We later iterate this expectation. 



Here, A m i n (£)||/3 t —/3*||| < \\X((3 t — /3*) ||| i.e. A m ; n (£) is the smallest eigenvalue 
of £ (singular value of X). It follows that 

(RK) ®M-/3*||1< \\Po-P*\\t (3-10) 

(RGS) E\\(3 t - (3* HI < (l - ||/3 0 - (3*\&, 

where ||u>|||j = w*'Ew = ||JXxt?||| is the norm induced by £. Since £ is invertible when 
m > n and X has full column rank, the last equation also implies linear convergence 


6 















of E||/3 t — /3*|||. The final results exist in Strohmer and Vershynin [26], Leventhal and 
Lewis [15] but there is utility in seeing the two proofs in a form that differs from their 
original presentation, side by side. In this setting, both RK and RGS are essentially 
equivalent (without computational considerations). 

3.2. Overconstrained System, Inconsistent. Here, we will assume that m > 
n, X is full column rank, and the system is inconsistent, so y = X(3ls + r, where r 
is such that X*r = 0. It is easy to see this condition, because as mentioned earlier, 

Pls = {X* X)~ 1 X*y, 


implying that X*X/3ls = X*y. Substituting y = X/3 ls + r gives that X*r = 0. 

In this setting, RK is known to not converge to the least squares solution, as 
is easily verified experimentally and geometrically. The tightest convergence upper 
bounds known are by [18] and [30] who show that 


E\\/3 t -(3 LS \\l< 1 — 


= 1 — 


A m in(5j) 
inn(*)\ 

11*111 J 


IIA) ~ Pls\\1 + 


WPo-PlsWI 


Amin(S) 


h 2 „i „(*)’ 


where we write <T m i n (-X') to denote the smallest (non-zero) singular value of X and 
again ||X||i? its Frobenius norm. Attempting the previous proof, (3.8) no longer holds 
- the Pythagorean theorem fails because (3 t+1 — /3 ls is no longer orthogonal to X 1 
since X ? (/3 t+1 — (3ls) = y l —X 1 /3ls ^ 0. Intuitively, the reason RK does not converge 
is that every update of RK (say of row i) is a projection onto the “wrong” hyperplane 
that has constant y l (where the “right” hyperplane would involve projecting onto a 
parallel hyperplane with constant y l — r 1 where r was defined above). An alternative 
intuition is that all RK updates are in the span of the rows, but /3ls is not in the 
row span. These intuitive explanations are easily confirmed by experiments seen in 
[30, pp. 787-788],[18, pp. 402], Zouzias and Freris [30] alleviate this issue with the 
REK algorithm, whose convergence obeys 

Ellft - Ml < (l - l " 2J (l + . (3.11) 

However, the fate of RK doesn’t hold for RGS. Almost magically, in the previous 
proof, the Pythagorean theorem still holds in equation (3.9) because 


X* {j) (Xf3 t+1 - X(3 ls ) = X*^ j) (X(3 t+1 -y) + X* {j) (y - X(3 LS ) = 0. (3.12) 

The first term is 0 by the optimality condition for /3 t+1 i.e. X(j)(Xf3 t+ 1 y) — 
dL/d(3 3 = 0. The second term is zero by the global optimality of /3is, i.e. X*(y — 
X(3ls) = VL = 0. Also, £ is full rank as before. Indeed, RGS works in the space of 
fitted values X/3 and not the iterates (3. 

In summary, RK does not converge to the LS solution, but RGS does at the same 
linear rate. This is what motivated the development of Randomized Extended Kacz- 
marz (REK) by Zouzias and Freris [30] which, as discussed earlier, is a modification 
of RK designed to converge to (3ls by randomly projecting out r. An independent 
paper by Dumitrescu [5] argues however that in this setting RGS is preferable to REK 
in terms of computational convergence. 


7 








3.3. Underconstrained System, Infinite Solutions. Here, we will assume 
that m < n, X is full row rank and the system is consistent with infinitely many 
solutions. As mentioned earlier, it is easy to show that 

Aljv = X*(XX*) -1 y 

(which clearly satisfies X/3ln = y )■ Every other consistent solution can be expressed 
as 


(3 = (3 ln + z where Xz = 0. 

Clearly any such (3 would also satisfy Xf3 = X(3ln = V- Since Xz = 0, z L (3 ln 
implying ||/3|| 2 = ||/3 _ljv|| 2 + || 2 : || 2 , showing that (3ln is indeed the minimum norm 
solution as claimed. 

In this case, RK has good behavior, and starting from (3 0 = 0, it does converge 
linearly to (3ln- Intuitively, /3ln = X*a (for a = (XX*) -1 !/) and hence is in the 
row span of X. Starting from (3 0 = 0, RK only adds multiples of rows to its iterates, 
and hence will never have any component orthogonal to the row span of X. There is 
exactly one solution with no component orthogonal to the row span of X 1 and that 
is Pln, and hence RK converges linearly to the required point, where the rate can 
be bounded in exactly the same way as (3.10). It is important not to start from an 
arbitrary /3 0 since the RK updates can never eliminate any component of /3 0 that is 
perpendicular to the row span of X. Of course, the same properties are shared by 
REK for this case as well. It is noted in Zouzias and Freris [30, Sec. 2.1] that the REK 
converges at the same rate for underdetermined systems as it does overdetermined 
systems. 

Mathematically, the previous earlier proof works because the Pythagorean theo¬ 
rem holds since it is a consistent system. Now, £ is not full rank but note that since 
both /3ln and /3 t are in the row span, f3 t — (3 ln has no component orthogonal to 
X (unless it equals zero in which case the algorithm has already converged). Hence 
A m in(S)||/3 t - (3ln\\ 2 < \\X(/3 t — (3ln)\\ 2 holds, where A m i n (£) is now understood to 
be the smallest positive eigenvalue of £. To summarize, the exact same bound (3.10) 
still holds in this case, with the appropriate understanding of A m i n (£) and under the 
assumption that the initialized (3 0 is in the row span of X. 

RGS unfortunately suffers the opposite fate. The iterates do not converge to 
/ 3ln , even though X(3 t does converge to X(3ln- Mathematically, the convergence 
proof still carries forward as before, but in the last step when X*X cannot be inverted 
because it is not full rank. Hence we get convergence of the residual to zero, without 
getting convergence of the iterates to the least norm solution. Intuitively, the iterates 
of RGS add components to the estimates that are orthogonal to the row span of X. 
These components are never eliminated because in minimizing the residual, they are 
ignored. Therefore, RGS is able to minimize the residual without finding the least 
norm solution. 

Unfortunately, when each update is cheaper for RK than RGS (due to matrix 
size), RGS is preferred for reasons of convergence and when it is cheaper for RGS 
than RK, RK is preferred. 

4. REGS. We next introduce an extension of RGS, analogous to the extension 
REK of RK. The purpose of extending RK was to allow for convergence to the least 
squares solution. Now, the purpose of extending RGS is to allow for convergence to 
the least norm solution. We view this method as a completion to the unified analysis 
of these approaches, and it may also possess advantages in its own right. 


4.1. The algorithm. Consider the linear system (1.1) with m < n. Let Pln 
denote the least norm solution of the underdetermined system as described in (1.3). 
The REGS algorithm is described by the following pseudo-code. Analogous to the 
role 2 plays in REK, z iteratively approximates the component in (3 orthogonal to 
the row-span of X. By iteratively removing this component, we converge to the least 
norm solution. Note that outputting f3 t instead of (3[ N = (3 t — z t in Algorithm 1 
recovers the RGS algorithm. This may be preferable in the overdetermined setting. 


Algorithm 1 Randomized Extended Gauss-Seidel (REGS) 


1: 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 


procedure (X, y , maxlter) > m x n matrix X , y £ C m , maximum iterations T 
Initialize (3 0 = 0, Zq = 0 
for t = 1, 2,..., T do 

II X 11^ 

Choose column X^ with probability 
Choose row X 1 with probability 


Set 7 t 
Set f3 t 


X* a) (X/3 t _ x ~y) 

POT 

Pt -1 + It 


e (i) 


(X*')* x^ 

Set P[ = Id n p^-jp > Id n denotes the n x n identity matrix 

Update z t = P l (z t -i + 7 t ) 

Update {3 ! t ,V = P t — z t 

end for 
Output Pt N 

end procedure 


4.2. Main result. Our main result for the REGS method shows linear conver¬ 
gence to the least norm solution. 

Theorem 4.1. The REGS algorithm outputs an estimate P^ N such that 

E\\Pt N - PlnWI <a T \\p LN \\l + 2a^-^- (4.1) 

1 — a 

where B = ^Yxrf 2 and a ={ l ~ Tfjf 1 ) ■ 

Proof: We devote the remainder of this section to the proof of Theorem 4.1. 

Let Et_i denote the expected value conditional on the first t—1 iterations, and in¬ 
state the notation of the theorem. That is, E t _i[-] = E[-|ii, ji, * 2 , where 

* t » is the t* th row chosen and jt * is the t* th column chosen. We denote conditional ex¬ 
pectation with respect to the choice of column as Ej_ x [-] = E[- | i\,j\, ...i t ~i, jt-i,it\- 
Similarly, we denote conditional expectation with respect to the choice of row as 

E|_-j_[•] = E[- | Then note by the law of total expectation we 

have that E t _i[-] = E|_ x [E^_ 1 [-]]. We will use the following elementary facts and 
lemmas. 

Fact 1. ([30, Fact 3]) For any Pi as in the algorithm, E||Pjic III < a\\w\\% for 
any w. 

Lemma 4.2. ([15, Thm. 3.6]) We have that 

Et-i\\xp t - XPlnWI < aWXp^ - Xp LN \\l 
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and that 


E\\Xf3 t - X0 LN \\ 2 < a t \\Xf3 0 - X(3 LN \\ 2 2 . 

Now we first consider \\f3^ N — 0ln\\2 : 

\\rf N -P LN \\l=\\P t -z t -l3 LN \\l 

= \\0 t — Pi(z t - 1 + 7 1 ) — PiPLN — ( Id n - Pi)f3 LN \\ 2 2 
= II Pt ~ p i(zt -1 + P t - P t -i) - PiPln - (Jd n - -Pi)/3ijv||| 

= ||(Jd„ - Pi)0 t + Piffit^ - z t - 1 ) - - (. Id n - P,)/3iiv||2 

= || {Id n ~ Pi)/3 t + PiPt-1 - PiPLN - {Idn - PJPlnWI 
= IIPiGaf^i - /3iiv) + (/d„ - Pi)(/3 t - Pln)\\1 
= II P t {pf-1 - Pln )111 + ||(/d„ - P»)(/3 t - /3 -C.jv)111- (4.2) 


So far, we have only used substitution of variables as defined for the algorithm and 
that (3ln = Pi^LN + {Id n — Pi)f3LN is an orthogonal decomposition. We first focus 
on the expected value of the second term. 

Lemma 4.3. We also have that 


E t _ 1 ||(Id n -P0(/3 t -/3iJv)||2 < 


a\\X{f3 t _ 1 -(3 LN )\\l 


\\X 


Proof: 


E t _ 1 ||(Id n -P i )(/3 t -/3 iJV )||l 

= E t _i[(/3 t - p LN )*(Id n - Pi)* {Id n - Pi)(/3 t - /3ljv)] 
= E t _i[(/3 t - (3 LN )*{Id n - Pi){0 t - 0 ln)] 

x. / (*1*** 


= E f _i 
= E t _i 

= 


= Ej_i 


= E?_r 


(/3 t - Pln)* 


11*12 


\X\f3 t -(3 L N)\\l 


1*12 


EU 


\\X i ((3 t — {3 l 


JVj||2 


1*12 


1*1/3* — Pln)\\ 


11*12 


< 


E 

2=1 

1I*(/3 / -/3iav)||2 

\\x\\ 2 f 

a \\X{f3 t -i 111 
11*111 


{Pt - Pln) 


11*12 


11*111 


The first line follows by expanding the norm, the second line since {Id n — Pi) 
is a projection matrix, the third line from the definition of Pi, the fourth line is 
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computation, the fifth line follows from the law of total expectation, the next two 
lines are computation, and finally the last line follows by Lemma 4.2. Notice that in 
the seventh line, E)_ : = E t _i because the random variable (3 t only depends on the 
choice of columns. □ 

We want to control the term r* = E||(Id„ — Pi)(/3 t — (3 ln)\\2 by bounding it by 
some a and B such that r t < o^B. We calculate this here: 

E||(/d„ - P t )((3 t - f3 LN )\\ 2 2 = E[E t _ x | \(Id n - P,)((3 t - 0 LN )\\l] 

„ aE||X(/3 t _ 1 -/3 iJV )||| 

\m 

/ t\\X0 o -Xf3 LN \\l 

- G — we —■ 

The first line follows by definition, the second is by Lemma 4.3, and the third by 
Lemma 4.2. 

Finally, we take the expected value of ||/3f N — 0ln\\2- From equation (4.2) and 
using Fact 1 we obtain: 

E||/3 t LAr - 0 LN \\l = E||P ~ Pln)\\1 + E|| (Id n - Pi)((3 t - (3 LN )f 2 

< aE\\{0t-i~ M\\l+midu - Pi)(0 t -0LN) III- 

We complete the proof using the following lemma from [30]: 

Lemma 4.4. ([30, Thm. 8]) Suppose that for some a, a < 1, the following bounds 
hold for all t* >0: 

Ell/#* - p LN \\l < oEII/#*! - 0 LN \\l + rt* and r t . < a e B. 

Then for any T > 0, 

n\0T N -0 ln\\1 <cx T \\Pq N -PlnWI + (aL T/2J +aLT/2j ) _^_ i 

1 — a 

Letting a = a = a, r[ = E|| (Id n - P,)(/3 t - 0LN)\\h B = l|X/3 °|p^f' I ' Jvl12 , and 

noting that /3 q n = /3 0 = 0, we complete the proof of Theorem 4.1. 

□ 

Remark 1. Here we note that the same proof works for overdetermined systems. 
In particular, this works because Lemma f.2 holds for /3 ls and (3* also (see Thm. 
3.6 in [15]). Also, Lemma ).3 follows for both overdetermined consistent systems (see 
table in Section 3.1) as well as overdetermined inconsistent systems (from (3.12) and 
subsequent arguments). 

4.3. Comparison. Theorem 4.1 shows that, like the RK and REK methods, 
REGS converges linearly to the least-norm solution in the underdetermined case. We 
believe it serves to complement existing analysis and completes the theory of these 
iterative methods in all three cases of interest. For that reason, we compare the three 
approaches for the underdetermined setting here. For ease of comparison, set a as in 
Theorem 4.1, and write k = cr max (X)/cr m i n (X) for the condition number of X. From 
the convergence rate bounds for RK [26] and REK [30] given in Section 3, and after 
applying elementary bounds to (4.1) of Theorem 4.1, we have: 
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(RK) 

n\pt-pLN\\i 

< 

a* /3liv| 2 

(4.3) 

(REK) 

E||/3 2 t ~ /3lat|| 2 

< 

a f (l + 2k 2 )||/3i,jv| 2 

(4.4) 

(REGS) 

eIIAh - Pln\\1 

< 

a*(l + 2k 2 )||/3/,at| 2 - 

(4.5) 

We find similar results in the overdetermined, 
convergence rate bounds for RGS [15], REK [30], 
given in section 3, we have: 

inconsistent setting. Using the 
and REGS (Theorem 4.1), also 

(RGS) 

E||/3 t ~ Pls\\1 

< 

Oi*\\(3 ls\\2 

(4.6) 

(REK) 

EllAtt-foslll 

< 

a t (l + 2 K 2 )||/3 iS || 2 

(4.7) 

(REGS) 

me*-Ml 

< 

a 4 (l + 2k 2 )||/3z,s 2 - 

(4.8) 


Thus, up to constant terms (which are likely artifacts of the proofs), the bounds 
provide the same convergence rate a, which is not surprising in light of the con¬ 
nections between the methods. In the next section, we compare these approaches 
experimentally. 

5. Empirical Results. In this section we present our experimental results. The 
code used to run these experiments can be found at [4]. For each experiment, we 
initialize a matrix X and vector [3 with independent standard normal entries and run 
50 trials. The right hand side y is taken to be X/3. At each iteration t, we keep track of 
the £ 2 -error \\(3f N — [3ln W 2 and fix the stopping criterion to be \\f3f N — (3 lnWI < 10 -6 
(of course in practice one chooses a more practical criterion). In each plot, the solid 
blue line represents the median £ 2 -error at iteration t, the light blue shaded region 
captures the range of error across trials, and the red line represents the theoretical 
upper bound at each iteration. In Figure 5.1, we show the convergence of j3^ N for 
varying sized underdetermined linear systems. In Figure 5.2, we show the convergence 
of a matrix X of size 700x1000 and its theoretical upper bound. As it turns out, the 
REGS algorithm often converges much faster than the theoretical worst-case bound. 

We also tested REGS on tomography problems using the Regularization toolbox 
by Hansen [11] (http://www. imm.dtu.dk/^pcha/Regutools/). For the 2D tomog¬ 
raphy problem X/3 = y with X an m x n matrix where n = dN 2 and m = TV 2 , we 
use TV = 20 and d = 3 for our experiments. Here, X consists of samples of absorption 
along a random line on an TV x TV grid and d is the oversampling factor. The results 
from this experiment are shown in Figure 5.2. 

We also compare the performance of all four algorithms (RK, REK, RGS, REGS) 
under the different settings discussed in this paper. Each line in each plot represents 
the median £ 2 -error at that iteration or CPU time over 50 trials using a stopping 
criterion of 10~ 6 . For the underdetermined case, X is a 50 x 500 Gaussian matrix 
and a 500 x 50 Gaussian matrix for the overdetermined cases. In the overdetermined, 
inconsistent case, we set y = X/3 + r where r £ null(X*) (computed in Matlab using 
the null () function). Figure 5.3, Figure 5.4, and Figure 5.5 show the empirical results 
for the underdetermined, overdetermined inconsistent, and overdetermined consistent 
cases respectively. Note we only plot the methods which actually converge to the 
desired solution in each case. Looking at iterations to convergence, it seems that RK 
and RGS converge faster than their extended counterparts while REGS and REK 
converge to the desired solution at about the same rate. 
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Figure 5.1. Left: t^-crror (log-scale) of REGS on a 150 x 500 matrix and its the theoretical 
bound. Right: Comparison of £ 2 -error (log-scale) of REGS for m x 500 sized matrices with m = 
50,100,150. 
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Figure 5.2. Left: i 2 -error (log-scale) of REGS on a 700 X 1000 matrix and its the theoretical 
bound. Right: i^-error (log-scale) of REGS on the tomography problem with a 400 x 1200 matrix. 


6. Conclusion. The Kaczmarz and Gauss-Seidel methods operate in two differ¬ 
ent spaces (i.e. row versus column space), but share many parallels. In this paper 
we drew connections between these two methods, highlighting the similarities and 
differences in convergence analysis. The approaches possess conflicting convergence 
properties; RK converges to the desired solution in the underdetermined case but 
not the inconsistent overdetermined setting, while RGS does the exact opposite. The 
extended method REK in the Kaczmarz framework fixes this issue, converging to the 
solution in both scenarios. Here, we present the REGS method, a natural extension 
of RGS, which completes the overall picture. We hope that our unified analysis of all 
four methods will assist researchers working with these approaches. 
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Figure 5.3. Comparison of median -error (log-scale) of RK, REK, and REGS for an under¬ 
determined system. 




Figure 5.4. Comparison of median -error (log-scale) of RGS, REK, and REGS for an 
overdetermined, inconsistent system. 
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